Extracts UTF-8 sequence validation to BitString class#717
Extracts UTF-8 sequence validation to BitString class#717mward-sudo wants to merge 4 commits intobartblast:devfrom
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds four static UTF‑8 helpers to Changes
Sequence Diagram(s)(omitted — changes are internal helper extraction and local refactor; no multi‑component sequential flow warranted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
assets/js/erlang/unicode.mjs (1)
325-340: Four identicalfindValidUtf8Lengthclosures — extract to a shared module-level helperAfter this PR all four normalization functions (
characters_to_nfc_binary/1,characters_to_nfd_binary/1,characters_to_nfkc_binary/1,characters_to_nfkd_binary/1) definefindValidUtf8Lengthwith byte-for-byte identical bodies. Lifting it to a module-level function would eliminate the duplication.♻️ Proposed extraction (to be placed before the `Erlang_Unicode` object)
+// Scans forward to find the longest valid UTF-8 prefix of `bytes`. +// Returns the byte offset past the last valid sequence. +// Time complexity: O(n). +const findValidUtf8Length = (bytes) => { + let pos = 0; + while (pos < bytes.length) { + const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]); + if (seqLength === false || !Bitstring.isValidUtf8Sequence(bytes, pos, seqLength)) + break; + pos += seqLength; + } + return pos; +}; + const Erlang_Unicode = {Then remove all four local
findValidUtf8Lengthdeclarations and reference the shared one.Also applies to: 579-594, 686-701, 792-806
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 325 - 340, Extract the duplicated closure findValidUtf8Length into a single module-level helper placed before the Erlang_Unicode object and remove the four local definitions inside characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 so they reference the shared helper; keep the exact validation logic (use Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence) and ensure callers still pass the same bytes array so behavior is unchanged.assets/js/bitstring.mjs (1)
259-259: Object literals allocated on every call in hot-path functions — consider static class fieldsBoth
firstByteMasks(line 259) andminValueForLength(line 602) construct a new object on every invocation. These functions are called per-byte during UTF-8 validation, making them hot paths. Promoting these to private static class fields (arrays, indexed by sequence length) eliminates per-call allocation:♻️ Proposed refactor — static array fields
export default class Bitstring { static `#decoder` = ERTS.utf8Decoder; static `#encoder` = new TextEncoder("utf-8"); + // Indexed by UTF-8 sequence length (1–4); unused positions are 0. + static `#UTF8_FIRST_BYTE_MASKS` = [0, 0, 0x1f, 0x0f, 0x07]; + static `#UTF8_MIN_CODE_POINT` = [0, 0, 0x80, 0x800, 0x10000];static decodeUtf8CodePoint(bytes, start, length) { if (length === 1) return bytes[start]; - - // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07 - const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07}; - - let codePoint = bytes[start] & firstByteMasks[length]; + let codePoint = bytes[start] & $.#UTF8_FIRST_BYTE_MASKS[length];static isValidUtf8CodePoint(codePoint, encodingLength) { - const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; - - if (codePoint < minValueForLength[encodingLength]) return false; + if (codePoint < $.#UTF8_MIN_CODE_POINT[encodingLength]) return false;Also applies to: 602-602
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` at line 259, The object literals firstByteMasks and minValueForLength are being reallocated on every hot-path call; move them out of the functions and into private static class fields (e.g., BitString.#firstByteMasks and BitString.#minValueForLength) as arrays indexed by sequence length, then replace uses of the local objects with lookups into these static fields to avoid per-call allocations (update any functions referencing firstByteMasks and minValueForLength to use the static field names).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 619-634: isValidUtf8Sequence currently omits verification that
bytes[start] is a valid UTF‑8 leader for the given length; update the function's
doc comment (above static isValidUtf8Sequence) to explicitly state the
precondition that the leader byte must be pre-validated (e.g., by
getUtf8SequenceLength) and that this routine only checks bounds, continuation
bytes via $.isValidUtf8ContinuationByte and codepoint validity via
$.decodeUtf8CodePoint/$.isValidUtf8CodePoint; alternatively, if you prefer full
self‑validation, add an explicit leader‑byte check against the expected leader
bit pattern for the provided length before the continuation checks.
In `@assets/js/erlang/unicode.mjs`:
- Line 687: In characters_to_nfkc_binary/1 update the inline comment "// scan
forward, validating each sequence" to match the capitalization used in the other
findValidUtf8Length blocks (e.g. change to "// Scan forward, validating each
sequence") so the comment casing is consistent with the other occurrences in the
findValidUtf8Length implementations.
---
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Line 259: The object literals firstByteMasks and minValueForLength are being
reallocated on every hot-path call; move them out of the functions and into
private static class fields (e.g., BitString.#firstByteMasks and
BitString.#minValueForLength) as arrays indexed by sequence length, then replace
uses of the local objects with lookups into these static fields to avoid
per-call allocations (update any functions referencing firstByteMasks and
minValueForLength to use the static field names).
In `@assets/js/erlang/unicode.mjs`:
- Around line 325-340: Extract the duplicated closure findValidUtf8Length into a
single module-level helper placed before the Erlang_Unicode object and remove
the four local definitions inside characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and
characters_to_nfkd_binary/1 so they reference the shared helper; keep the exact
validation logic (use Bitstring.getUtf8SequenceLength and
Bitstring.isValidUtf8Sequence) and ensure callers still pass the same bytes
array so behavior is unchanged.
073775f to
b9bfe2d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
assets/js/bitstring.mjs (1)
255-269: Per-call object literal in a hot-path inner function.
firstByteMasksis re-allocated on every invocation ofdecodeUtf8CodePoint, which is called from within the tightwhileloop infindValidUtf8Lengthacross every unicode function. Hoisting it (orisValidUtf8CodePoint'sminValueForLength, see below) to a module-level constant eliminates per-call heap allocation.♻️ Proposed refactor: hoist lookup table to module scope
+// Pre-allocated lookup tables for UTF-8 decoding/validation (avoid per-call allocation) +const _UTF8_FIRST_BYTE_MASKS = [0, 0xff, 0x1f, 0x0f, 0x07]; +const _UTF8_MIN_CODE_POINT = [0, 0, 0x80, 0x800, 0x10000]; static decodeUtf8CodePoint(bytes, start, length) { if (length === 1) return bytes[start]; - // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07 - const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07}; - - let codePoint = bytes[start] & firstByteMasks[length]; + let codePoint = bytes[start] & _UTF8_FIRST_BYTE_MASKS[length]; for (let i = 1; i < length; i++) { codePoint = (codePoint << 6) | (bytes[start + i] & 0x3f); } return codePoint; }static isValidUtf8CodePoint(codePoint, encodingLength) { - const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000}; - - if (codePoint < minValueForLength[encodingLength]) return false; + if (codePoint < _UTF8_MIN_CODE_POINT[encodingLength]) return false; if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false; if (codePoint > 0x10ffff) return false; return true; }Place the two
constdeclarations at the top of the module (outside the class), alongside the existing private class fields area, to keep them co-located with the methods that use them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/bitstring.mjs` around lines 255 - 269, Hoist the per-call lookup tables into module-level constants to avoid allocating them on every decodeUtf8CodePoint invocation: move the object firstByteMasks (used in decodeUtf8CodePoint) — and likewise minValueForLength used by isValidUtf8CodePoint if present — out of the function and define them once at module scope, then update decodeUtf8CodePoint and isValidUtf8CodePoint to reference those constants instead of creating new objects each call.assets/js/erlang/unicode.mjs (1)
326-339: Four identicalfindValidUtf8Lengthbodies — consider extracting toBitstring.Now that the body is entirely composed of
Bitstringstatic calls, the implementations incharacters_to_nfc_binary/1,characters_to_nfd_binary/1,characters_to_nfkc_binary/1, andcharacters_to_nfkd_binary/1are byte-for-byte identical. A singleBitstring.findValidUtf8PrefixLength(bytes)helper would eliminate the duplication and be consistent with the PR's overall goal of centralising UTF-8 logic.♻️ Sketch of the extraction
In
assets/js/bitstring.mjs:+ // Scans bytes forward and returns the length of the longest valid UTF-8 prefix. + static findValidUtf8PrefixLength(bytes) { + let pos = 0; + while (pos < bytes.length) { + const seqLength = $.getUtf8SequenceLength(bytes[pos]); + if (seqLength === false || !$.isValidUtf8Sequence(bytes, pos, seqLength)) + break; + pos += seqLength; + } + return pos; + }Then each
findValidUtf8Lengthlocal inside the NFC/NFD/NFKC/NFKD functions collapses to:- const findValidUtf8Length = (bytes) => { - // Scan forward, validating each sequence - let pos = 0; - while (pos < bytes.length) { - const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]); - if ( - seqLength === false || - !Bitstring.isValidUtf8Sequence(bytes, pos, seqLength) - ) - break; - pos += seqLength; - } - return pos; - }; + const findValidUtf8Length = Bitstring.findValidUtf8PrefixLength;Also applies to: 583-594, 690-701, 795-806
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/js/erlang/unicode.mjs` around lines 326 - 339, Extract the repeated UTF-8 prefix scanning logic into a new Bitstring.findValidUtf8PrefixLength(bytes) helper that loops using Bitstring.getUtf8SequenceLength(byte) and Bitstring.isValidUtf8Sequence(bytes, pos, seqLength) and returns the valid prefix length; add this helper to Bitstring (e.g., assets/js/bitstring.mjs) and replace the inline loops found in characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 (and the other occurrences around the indicated ranges) with a single call to Bitstring.findValidUtf8PrefixLength(bytes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@assets/js/erlang/unicode.mjs`:
- Line 687: Normalize the comment casing for consistency by changing the inline
comment "// scan forward, validating each sequence" to use sentence-style
capitalization (e.g., "// Scan forward, validating each sequence") wherever this
pattern appears in the file; update the specific comment near the Unicode
validation logic in unicode.mjs (the comment currently reading "scan forward,
validating each sequence") so it matches the project's comment casing standard.
---
Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: Hoist the per-call lookup tables into module-level
constants to avoid allocating them on every decodeUtf8CodePoint invocation: move
the object firstByteMasks (used in decodeUtf8CodePoint) — and likewise
minValueForLength used by isValidUtf8CodePoint if present — out of the function
and define them once at module scope, then update decodeUtf8CodePoint and
isValidUtf8CodePoint to reference those constants instead of creating new
objects each call.
In `@assets/js/erlang/unicode.mjs`:
- Around line 326-339: Extract the repeated UTF-8 prefix scanning logic into a
new Bitstring.findValidUtf8PrefixLength(bytes) helper that loops using
Bitstring.getUtf8SequenceLength(byte) and Bitstring.isValidUtf8Sequence(bytes,
pos, seqLength) and returns the valid prefix length; add this helper to
Bitstring (e.g., assets/js/bitstring.mjs) and replace the inline loops found in
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 (and the other
occurrences around the indicated ranges) with a single call to
Bitstring.findValidUtf8PrefixLength(bytes).
|
@coderabbitai Don't do any further reviews on this PR while it is in draft. |
|
✅ Actions performedReviews paused. |
Closes #713
Dependencies
Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.
Summary by CodeRabbit
New Features
Refactor
Tests